Forward Step Functions Tags with logs to the backend #618

kimi-p · 2022-11-10T16:17:41Z

What does this PR do?

Fetch Step Functions tags from AWS resourcegroupstaggingapi api and attach these tags to logs.
An integration test for step function is added.
The same S3 caching strategy is used as Lambda tags.
CloudFormation template is updated and DD_FETCH_STEP_FUNCTIONS_TAGS flag is added with default to true.
DEPLOY_TO_SERVERLESS_SANDBOX env var flag is add for installation_test.sh to deploy to Serverless sandbox account.
Two S3 buckets are created in Serverless sandbox account, datadog-cloudformation-template-serverless-sandbox and dd-lambda-signing-bucket-serverless-sandbox.
This issue about CloudWatch logs not using S3 cache at all is not address in this PR. I will discuss with the team and have another PR to fix it. Hopefully will release this PR and the fix to CW tags caching together after that work is done.

Motivation

logs-to-traces project build Step Functions traces from aws logs. To get the env tag, we'd like to fetch Step Function tags on the forwarder and send these tags with logs to the logs intake. logs-to-traces reducer will then pick up these tags and label traces with the correct env.
https://datadoghq.atlassian.net/browse/SLS-2718

Testing Guidelines

Tested in Serverless sandbox account by running ./installation_test.sh with stack deletion line commented out.

logs are tagged with env:staging123 and kimi_test:kimi-test (tags are labeled on the testing state machine)

JSON file used for caching also works as expected

Verified that CloudWatch logs are also attached

Additional Notes

resourcegroupstaggingapi API doc

Testing Forwarder Logs

{
    "PaginationToken": "",
    "ResourceTagMappingList": [
        {
            "ResourceARN": "arn:aws:states:sa-east-1:425362996713:stateMachine:logs-to-traces-complicated-state-machine",
            "Tags": [
                {
                    "Key": "KIMI_TEST",
                    "Value": "kimi-test"
                },
                {
                    "Key": "ENV",
                    "Value": "staging123"
                }
            ]
        }
    ],
    "ResponseMetadata": {
        "RequestId": "333d6c49-0508-44d5-9a43-8940e28b0554",
        "HTTPStatusCode": 200,
        "HTTPHeaders": {
            "x-amzn-requestid": "333d6c49-0508-44d5-9a43-8940e28b0554",
            "content-type": "application/x-amz-json-1.1",
            "content-length": "243",
            "date": "Tue, 15 Nov 2022 16:24:54 GMT"
        },
        "RetryAttempts": 0
    }
}

Types of changes

Bug fix
New feature
Breaking change
Misc (docs, refactoring, dependency upgrade, etc.)

Check all that apply

This PR's description is comprehensive
This PR contains breaking changes that are documented in the description
This PR introduces new APIs or parameters that are documented and unlikely to change in the foreseeable future
This PR impacts documentation, and it has been updated (or a ticket has been logged)
This PR's changes are covered by the automated tests
This PR collects user input/sensitive content into Datadog
This PR passes the integration tests (ask a Datadog member to run the tests)
This PR passes the unit tests
This PR passes the installation tests (ask a Datadog member to run the tests)

aws/logs_monitoring/cache.py

kimi-p · 2022-11-16T16:15:09Z

Verified again after the scripts are updated.
both env and kimi_test tags are showing up correctly in this trace

DarcyRaynerDD

LGTM to me overall, left a few questions.

DarcyRaynerDD · 2022-11-17T15:07:38Z

aws/logs_monitoring/cache.py

+#######################
+
+
+class StepFunctionsTagsCache(LambdaTagsCache):


Now that we have three caches, it might be a good idea to split these Cache classes out into their own files.

DarcyRaynerDD · 2022-11-17T15:11:42Z

aws/logs_monitoring/cache.py

+        get_resources_paginator = resource_tagging_client.get_paginator("get_resources")
+
+        try:
+            for page in get_resources_paginator.paginate(


Just so I understand, we are following the lambda approach of prefetching the tags from each StepFunction, not the Cloudwatch Log group approach which doesn't fetch anything?

nine5two7 · 2022-11-18T13:08:57Z

aws/logs_monitoring/cache.py

+        Returns:
+            state_machine_tags (List[str]): the list of "key:value" Datadog tag strings
+        """
+        if self._is_expired():


Just curious. So, for a specific execution_arn, its tags are keeping changing? What kind of change can it have? Or once its tags are not null, there will be no change for these tags.

tags are cached for 300 seconds. Every 300 seconds, forwarder will refetch all tag for state machines. Does this answer your question?

@kimi-p My concern is not about TTL. I am just curious about what kind of change can happen for these cached tags. An expired item in the cache can still be usable if the item has not changed at all. For a state machine, its tags can change over time for each execution, while for a specific execution of a state machine (execution_arn), I am wondering what kind of change can happen to its tags.

The tags are actually on the state machines. So unless tags on these state machines are changing, SF logs' tags won't change. The TTL of 300 seconds is to make sure that the tags are somewhat fresh. I'm not sure if I answered your questions, we can talk about it in the standup.

nine5two7

LGTM

…itting aws api

kimi-p · 2022-11-18T20:05:23Z

Latest changes are:

Breaks cache.py into 4 files (base, lambda, cloudwatch log group, step functions)
Update .dockerignore to accept all py files sits in the same directory.

kimi-p · 2022-11-21T15:16:36Z

I have verified that after the refactor, logs are forwarded and their tags sent correctly.

kimi-p added 2 commits November 10, 2022 11:16

Add Step Functions Tags to logs

68634ef

fix

bd2d818

github-actions bot added the aws label Nov 10, 2022

kimi-p added 5 commits November 10, 2022 11:29

lint

e55fe86

lint

26e6339

lint

0240174

lint

17f5ec6

lint

67586c3

agocs reviewed Nov 10, 2022

View reviewed changes

aws/logs_monitoring/cache.py Outdated Show resolved Hide resolved

aws/logs_monitoring/cache.py Outdated Show resolved Hide resolved

kimi-p added 10 commits November 10, 2022 15:15

update

1662f21

add SF tags cache

41a6b50

use cache and liint

f8e6a06

update

1fcf9c3

lint

76121cc

update_var_name

8986716

add step functions snapshot

2ba9aaa

update

310efca

update parsing.py and cache.py

cb7e2f4

update parsing.py and cache.py

f2ac5d9

kimi-p marked this pull request as ready for review November 16, 2022 16:09

kimi-p changed the title ~~Add Step Functions Tags to logs~~ Forward Step Functions Tags with logs to the backend Nov 16, 2022

update scripts

440847b

kimi-p assigned agocs and DarcyRaynerDD and unassigned agocs and DarcyRaynerDD Nov 17, 2022

DarcyRaynerDD approved these changes Nov 17, 2022

View reviewed changes

nine5two7 reviewed Nov 18, 2022

View reviewed changes

nine5two7 approved these changes Nov 18, 2022

View reviewed changes

kimi-p added 6 commits November 18, 2022 11:04

break cache.py into 4 files, fix a lot of patches in unittests

ac99a00

merge latest change 0167 commit in

37150a4

update import path due to project root

5526f67

update dockerignore

28194ea

update dockerignore to allow all same level py files

ea1ebc5

remove feature flag in docker so that integration test will run w/o h…

aab8e12

…itting aws api

explicitly set DD_FETCH_STEP_FUNCTIONS_TAGS to false

7c263bc

kimi-p requested review from DarcyRaynerDD and nine5two7 November 18, 2022 20:15

kimi-p merged commit ecf1278 into master Nov 21, 2022

kimi-p deleted the kimi.add_sf_tags branch November 21, 2022 18:20

kimi-p mentioned this pull request May 7, 2024

breaking: Use sha256 to hash lambda traceId that are triggered by Step Functions and set _dd.p.tid DataDog/datadog-lambda-js#534

Merged

11 tasks

		#######################


		class StepFunctionsTagsCache(LambdaTagsCache):

Forward Step Functions Tags with logs to the backend #618

Forward Step Functions Tags with logs to the backend #618

Uh oh!

Conversation

kimi-p commented Nov 10, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Motivation

Testing Guidelines

Additional Notes

Types of changes

Check all that apply

Uh oh!

Uh oh!

Uh oh!

kimi-p commented Nov 16, 2022

Uh oh!

DarcyRaynerDD left a comment

Choose a reason for hiding this comment

Uh oh!

DarcyRaynerDD Nov 17, 2022

Choose a reason for hiding this comment

Uh oh!

DarcyRaynerDD Nov 17, 2022

Choose a reason for hiding this comment

Uh oh!

kimi-p Nov 18, 2022

Choose a reason for hiding this comment

Uh oh!

nine5two7 Nov 18, 2022

Choose a reason for hiding this comment

Uh oh!

kimi-p Nov 18, 2022

Choose a reason for hiding this comment

Uh oh!

nine5two7 Nov 18, 2022

Choose a reason for hiding this comment

Uh oh!

kimi-p Nov 18, 2022

Choose a reason for hiding this comment

Uh oh!

nine5two7 left a comment

Choose a reason for hiding this comment

Uh oh!

kimi-p commented Nov 18, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kimi-p commented Nov 21, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

kimi-p commented Nov 10, 2022 •

edited

Loading

kimi-p commented Nov 18, 2022 •

edited

Loading

kimi-p commented Nov 21, 2022 •

edited

Loading